Fault Tolerance in Message Passing Interface Programs

نویسندگان

  • William Gropp
  • Ewing L. Lusk
چکیده

In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerance in MPI Programs

This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, ...

متن کامل

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...

متن کامل

Message Relaying Techniques for Computational Grids and their Relations to Fault Tolerant Message Passing for the Grid

In order to execute without modification Message Passing distributed applications on a computational grid, one has to address many issues. The first to come is how let processes of two different clusters communicate. In this work, we study the performances of relaying techniques (passing messages to a middle-tier) to solve this issue. When using relays, messages and most of the nondeterministic...

متن کامل

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, “What agent capabilities are required for fault tolerance?”, “What parallel computational tasks can benefit f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJHPCA

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2004